Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

horheynm
Copy link
Collaborator

SUMMARY:

  • Add preprocessing_num_workers to run dataset processing in parallel for 2:4 example.

Before:
Tokenizing: 371.12 examples/s,
Adding labels: 1890.18 examples/s,
Tokenizing: 333.39 examples/s

Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:34<00:00, 371.12 examples/s]
Adding labels: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:06<00:00, 1890.18 examples/s]
Tokenizing:   9%|█████████▌                                                                                                     | 22077/256032 [00:59<11:41, 333.39 examples/s

After (num_proc=8):
Tokenizing: 2703.93 examples/s,
Adding labels: 5524.98 examples/s,
Tokenizing: 2925.98 examples/s

Tokenizing (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:04<00:00, 2703.93 examples/s]
Adding labels (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:02<00:00, 5524.98 examples/s]
Tokenizing (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 256032/256032 [01:27<00:00, 2925.98 examples/s]

TEST PLAN:

  • Pass existing tests

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@horheynm horheynm added the ready When a PR is ready for review label Feb 25, 2025
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks

@dsikka dsikka enabled auto-merge (squash) February 25, 2025 20:33
@dsikka dsikka merged commit 77e4f4c into main Feb 25, 2025
7 checks passed
@dsikka dsikka deleted the num-proc-dataset branch February 25, 2025 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants